1 Introduction

Alzheimer’s disease is a neurological disorder characterized by the degeneration of brain cells, leading to dementia in 60% of all cases. The disease manifests itself through decreased cognitive capabilities and a reduced ability to behave independently (Breijyeh, Karaman, 2020). In the early stages, Alzheimer’s involve severe memory loss, apathy, and depression, with later stages being marked with communication problems, behavioral changes, as well as difficulties with walking and speaking (Alzheimer’s Association 2024 Alzheimer’s Disease Facts and Figures, 2024). Alzheimer’s disease is widespread and deadly. With over 7 million people in the United States living with Alzheimer’s, it is the fifth leading cause of death in people aged 65 and older (Alzheimer’s Association 2024 Alzheimer’s Disease Facts and Figures, 2024).

As demonstrated, Alzheimer’s disease is critical to be studied due to its prevalence and severity. Given the complex nature of neurodegenerative disease, early symptoms often go unnoticed by patients themselves, making Alzheimer’s difficult to diagnose until significant cognitive decline occurs (Breijyeh, Karaman, 2020). The combination of Alzheimer’s being difficult to recognize for medical professionals and difficult to notice by patients, themselves, makes it an often deadly threat. Thus, the primary objective of this research project is to identify the key determinants of Alzheimer’s disease. Uncovering such factors is vital and crucial nowadays, as no effective medicine has been found yet. Therefore, early detection of Alzheimer’s disease as well enhancing preventive measures are considered crucial in minimizing the impact of dementia.

The goal of alleviating and potentially preventing Alzheimer’s disease can be achieved by using data science and predictive modeling. More specifically, answering the key research question: What are the determinants and potential predictors of Alzheimer’s disease? can be achieved via application of the predictive models which may help to identify key features associated with dementia.

The Alzheimer’s Disease Dataset has been acquired from Kaggle. The author, Rabie El Kharoua, created this dataset offering extensive insights into the factors associated with Alzheimer’s Disease. The variables include: the diagnosis of Alzheimer’s disease, demographic data of patients (age, gender, ethnicity, education level), lifestyle factors (e.g. BMI, smoking habits, alcohol consumption), medical history(e.g. cardiovascular disease, diabetes), clinical measurements (e.g. cholesterol levels), cognitive and functional assessments (e.g. MMSE, ADL), and symptoms (e.g. confusion, personality changes, forgetfulness). These variables are highly relevant to this study because they are commonly associated with and tracked in people with Alzheimer’s (Breijyeh, Karaman, 2020), and so can be utilized in this research to determine which can most optimally predict Alzheimer’s.

This study aims to answer the research question: What are the determinants and potential predictors of Alzheimer’s disease? through a few phases. After data preprocessing (including handling missing data and identifying outliers), exploratory data analysis will be conducted to gain insights into the dataset’s main characteristics. Then, by developing predictive models such as logistic regression, Naive Bayes Classifier and k-nearest neighbors (kNN), key features associated with Alzheimer’s disease can be identified. The findings extracted from the data analysis will contribute to enhancing early detection and developing recommendations for mitigating the impact of Alzheimer’s disease.

To complement the analysis, following R libraries will be used throughout the research:

library(ggplot2) 
library(plyr)    
library(Hmisc)   
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(naniar) 
library(liver)
## 
## Attaching package: 'liver'
## The following object is masked from 'package:base':
## 
##     transform
library(pROC) 
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(naivebayes)
## naivebayes 1.0.0 loaded
## For more information please visit:
## https://majkamichal.github.io/naivebayes/
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

2 The Nature of the Research

Our research adopts an exploratory approach to understand the predictors of Alzheimer’s diagnosis. Thus, the aim is not only to assess the immediate relationships between these variables and Alzheimer’s diagnosis but also to uncover patterns that could inform future research, diagnosis, and potential preventive measures.

Given the complexity of Alzheimer’s disease and the multifaceted nature of its risk factors, we will formulate hypotheses at a group level, rather than focusing on individual predictors. The hypotheses will be exploratory in nature, allowing for an investigation into how demographic, lifestyle, and medical factors, among others, might collectively contribute to the onset of Alzheimer’s. By grouping predictors, we hope to identify clusters of risk factors and or individual factors that might provide insight into the mechanisms behind Alzheimer’s development. By understanding which factors are associated with increased or decreased likelihoods of Alzheimer’s diagnosis, we can potentially improve early detection strategies and guide interventions aimed at reducing risk.

3 Theorethical Framework

Alzheimer’s disease is widely regarded as a multifactorial condition, meaning that its risk is influenced by a combination of different factors, including genetic, environmental and lifestyle factors (Breijyeh, Karaman, 2020). Despite extensive research, mechanisms that cause the pathological changes related to Alzheimer’s disease remain unknown (Breijyeh, Karaman, 2020). Since the underlying mechanism remains elusive, focusing on modifiable risk factors like lifestyle is crucial for mitigating disease progression in the absence of a cure.

3.1 Demographic determinants of Alzheimer’s disease

Aging is one of the most prominent demographic risk factors for Alzheimer’s disease. Research shows that it is highly uncommon for young individuals to develop Alzheimer’s, with the vast majority of cases occurring in individuals 65 and above (Breijyeh, Karaman, 2020). This makes age a crucial determinant in understanding the onset and progression of Alzheimer’s, highlighting the need for targeted screening and preventive measures in older populations. Thus, the following hypothesis is constructed.

H1: Older age increases the likelihood of Alzheimer’s diagnosis.

3.2 Lifestyle determinants of Alzheimer’s disease

Lifestyle factors also contribute to Alzheimer’s disease risk. For example, exposure to air pollution has been linked to increased production of peptides commonly associated with decreased cognitive function (Breijyeh, Karaman, 2020). Moreover, saturated fatty acids and high-calorie diets have been found to lead to an increased incidence of Alzheimer’s, with malnutrition also exacerbating the condition (Breijyeh, Karaman, 2020). This highlights the significance of lifestyle choices in the progression of disease, and so our hypothesis is as follows.

H2: A healthier lifestyle, characterized by a lower BMI, non smoking status, low alcohol consumption, regular physical activity, good diet quality, and better sleep quality, is associated with a lower likelihood of Alzheimer’s diagnosis.

3.3 Medical history determinants of Alzheimer’s disease

Medical history is another important risk factor for Alzheimer’s. Cardiovascular diseases, diabetes, and obesity have all been linked to Alzheimer’s risk. For example, while obesity does not directly cause Alzheimer’s, it puts individuals at higher risk for cancer or cardiovascular disease, indirectly increasing the risk of Alzheimer’s disease (Breijyeh, Karaman, 2020). As it pertains to the factors within medical history that can be controlled (ie. obesity), these are especially important because they offer avenues for intervention. Thus, the following hypothesis will be tested.

H3: A history of chronic health conditions such as cardiovascular disease, diabetes, depression, hypertension, and head injury, as well as a family history of Alzheimer’s, increase the likelihood of an Alzheimer’s diagnosis.

3.4 Clinical Measurements and Cognitive and Functional Assessments as determinants of Alzheimer’s disease

Clinical measurements such as blood pressure, cholesterol levels, and cognitive and functional assessments provide additional insights into Alzheimer’s risk. Cholesterol levels, for instance, can contribute to the development of Alzheimer’s by accumulating in the brain tissue (Breijyeh, Karaman, 2020). Furthermore, those with Alzheimer’s or who are suspected of having it often undergo mental and physical examinations such as MMSE to assess their cognitive capabilities and functional assessment. Patients typically perform worse on these assessments compared to those without Alzheimer’s, reflecting the progressive nature of the disease (Guk-Hee et al., 2004).

H4: Poor cardiovascular health, indicated by high blood pressure and unfavorable cholesterol levels (high total cholesterol, high LDL, low HDL, and high triglycerides), is associated with a higher likelihood of Alzheimer’s diagnosis.

H5: Lower cognitive and functional scores (e.g., MMSE, functional assessment) and the presence of memory complaints or behavioral problems is associated with a higher likelihood of Alzheimer’s diagnosis.

3.5 Symptoms as determinants of Alzheimer’s disease

Alzheimer’s symptoms, manifesting in behavioral issues, memory loss, and disorientation, are often noticed by others before the patients are aware of their cognitive decline. Therefore, early symptom identification is critical, as it prompts further diagnostic evaluations (Breijyeh & Karaman, 2020).

H6: The presence of cognitive and behavioral symptoms such as confusion, disorientation, personality changes, difficulty completing tasks, and forgetfulness is positively associated with Alzheimer’s diagnosis.

4 Data Preparation

We import the csv file in R as follows.

setwd("/Users/vanessazyto/Desktop")
data = read.csv('alzheimer.csv', header = TRUE, sep = ';' )

To see the overview of the dataset in R, we are using function str() as follows:

str(data)
## 'data.frame':    2149 obs. of  35 variables:
##  $ PatientID                : int  4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 ...
##  $ Age                      : int  73 89 73 74 89 86 68 75 72 87 ...
##  $ Gender                   : int  0 0 0 1 0 1 0 0 1 0 ...
##  $ Ethnicity                : int  0 0 3 0 0 1 3 0 1 0 ...
##  $ EducationLevel           : int  2 0 1 1 0 1 2 1 0 0 ...
##  $ BMI                      : num  22.9 26.8 17.8 33.8 20.7 ...
##  $ Smoking                  : int  0 0 0 1 0 0 1 0 0 1 ...
##  $ AlcoholConsumption       : num  13.3 4.54 19.56 12.21 18.45 ...
##  $ PhysicalActivity         : num  6.33 7.62 7.84 8.43 6.31 ...
##  $ DietQuality              : num  1.347 0.519 1.826 7.436 0.795 ...
##  $ SleepQuality             : num  9.03 7.15 9.67 8.39 5.6 ...
##  $ FamilyHistoryAlzheimers  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ CardiovascularDisease    : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Diabetes                 : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Depression               : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ HeadInjury               : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Hypertension             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ SystolicBP               : int  142 115 99 118 94 168 143 117 117 130 ...
##  $ DiastolicBP              : int  72 64 116 115 117 62 88 63 119 78 ...
##  $ CholesterolTotal         : num  242 231 284 160 238 ...
##  $ CholesterolLDL           : num  56.2 193.4 153.3 65.4 92.9 ...
##  $ CholesterolHDL           : num  33.7 79 69.8 68.5 56.9 ...
##  $ CholesterolTriglycerides : num  162.2 294.6 83.6 277.6 291.2 ...
##  $ MMSE                     : num  21.46 20.61 7.36 13.99 13.52 ...
##  $ FunctionalAssessment     : num  6.52 7.12 5.9 8.97 6.05 ...
##  $ MemoryComplaints         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BehavioralProblems       : int  0 0 0 1 0 0 0 0 1 1 ...
##  $ ADL                      : num  1.7259 2.5924 7.1195 6.4812 0.0147 ...
##  $ Confusion                : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Disorientation           : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ PersonalityChanges       : int  0 0 0 0 1 0 0 0 1 0 ...
##  $ DifficultyCompletingTasks: int  1 0 1 0 1 0 0 0 0 0 ...
##  $ Forgetfulness            : int  0 1 0 0 0 0 1 1 0 0 ...
##  $ Diagnosis                : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ DoctorInCharge           : chr  "XXXConfid" "XXXConfid" "XXXConfid" "XXXConfid" ...

From the output of str(data), it can be seen that we have 2149 observations with 35 variables. As the variable PatientID only represents the identity of the observation, it will not be included in the statistical and the explanatory data analyses. However, we keep it as an indicative variable. As the variable DoctorInCharge contains confidential information about the doctor responsible for each patient and does not contribute to the statistical analysis, it will be excluded from our analysis too. Out of the 33 variables that will be actively used in the model, Diagnosis will be the target variable for the main hypothesis of the research. The definition and the meanings of each variable is shown as follows:

Considering the interpretation of each variable, the initial section of the dataframe is presented by using the head() function to have a clear understanding.

head(data)
##   PatientID Age Gender Ethnicity EducationLevel      BMI Smoking
## 1      4751  73      0         0              2 22.92775       0
## 2      4752  89      0         0              0 26.82768       0
## 3      4753  73      0         3              1 17.79588       0
## 4      4754  74      1         0              1 33.80082       1
## 5      4755  89      0         0              0 20.71697       0
## 6      4756  86      1         1              1 30.62689       0
##   AlcoholConsumption PhysicalActivity DietQuality SleepQuality
## 1          13.297218        6.3271125   1.3472143     9.025679
## 2           4.542524        7.6198845   0.5187671     7.151293
## 3          19.555085        7.8449878   1.8263347     9.673574
## 4          12.209266        8.4280014   7.4356041     8.392554
## 5          18.454356        6.3104607   0.7954975     5.597238
## 6           4.140144        0.2110616   1.5849220     7.261953
##   FamilyHistoryAlzheimers CardiovascularDisease Diabetes Depression HeadInjury
## 1                       0                     0        1          1          0
## 2                       0                     0        0          0          0
## 3                       1                     0        0          0          0
## 4                       0                     0        0          0          0
## 5                       0                     0        0          0          0
## 6                       0                     0        1          0          0
##   Hypertension SystolicBP DiastolicBP CholesterolTotal CholesterolLDL
## 1            0        142          72         242.3668       56.15090
## 2            0        115          64         231.1626      193.40800
## 3            0         99         116         284.1819      153.32276
## 4            0        118         115         159.5822       65.36664
## 5            0         94         117         237.6022       92.86970
## 6            0        168          62         280.7125      198.33463
##   CholesterolHDL CholesterolTriglycerides      MMSE FunctionalAssessment
## 1       33.68256                162.18914 21.463532             6.518877
## 2       79.02848                294.63091 20.613267             7.118696
## 3       69.77229                 83.63832  7.356249             5.895077
## 4       68.45749                277.57736 13.991127             8.965106
## 5       56.87430                291.19878 13.517609             6.045039
## 6       79.08050                263.94365 27.517529             5.510144
##   MemoryComplaints BehavioralProblems        ADL Confusion Disorientation
## 1                0                  0 1.72588346         0              0
## 2                0                  0 2.59242413         0              0
## 3                0                  0 7.11954774         0              1
## 4                0                  1 6.48122586         0              0
## 5                0                  0 0.01469122         0              0
## 6                0                  0 9.01568628         1              0
##   PersonalityChanges DifficultyCompletingTasks Forgetfulness Diagnosis
## 1                  0                         1             0         0
## 2                  0                         0             1         0
## 3                  0                         1             0         0
## 4                  0                         0             0         0
## 5                  1                         1             0         0
## 6                  0                         0             0         0
##   DoctorInCharge
## 1      XXXConfid
## 2      XXXConfid
## 3      XXXConfid
## 4      XXXConfid
## 5      XXXConfid
## 6      XXXConfid

Next, summary() function is used to summarize the dataframe and the characteristics of the variables.

summary(data)
##    PatientID         Age            Gender         Ethnicity     
##  Min.   :4751   Min.   :60.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:5288   1st Qu.:67.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :5825   Median :75.00   Median :1.0000   Median :0.0000  
##  Mean   :5825   Mean   :74.91   Mean   :0.5063   Mean   :0.6975  
##  3rd Qu.:6362   3rd Qu.:83.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :6899   Max.   :90.00   Max.   :1.0000   Max.   :3.0000  
##  EducationLevel       BMI           Smoking       AlcoholConsumption 
##  Min.   :0.000   Min.   :15.01   Min.   :0.0000   Min.   : 0.002003  
##  1st Qu.:1.000   1st Qu.:21.61   1st Qu.:0.0000   1st Qu.: 5.139810  
##  Median :1.000   Median :27.82   Median :0.0000   Median : 9.934412  
##  Mean   :1.287   Mean   :27.66   Mean   :0.2885   Mean   :10.039442  
##  3rd Qu.:2.000   3rd Qu.:33.87   3rd Qu.:1.0000   3rd Qu.:15.157931  
##  Max.   :3.000   Max.   :39.99   Max.   :1.0000   Max.   :19.989293  
##  PhysicalActivity    DietQuality        SleepQuality    FamilyHistoryAlzheimers
##  Min.   :0.003616   Min.   :0.009385   Min.   : 4.003   Min.   :0.0000         
##  1st Qu.:2.570626   1st Qu.:2.458455   1st Qu.: 5.483   1st Qu.:0.0000         
##  Median :4.766424   Median :5.076087   Median : 7.116   Median :0.0000         
##  Mean   :4.920202   Mean   :4.993138   Mean   : 7.051   Mean   :0.2522         
##  3rd Qu.:7.427899   3rd Qu.:7.558625   3rd Qu.: 8.563   3rd Qu.:1.0000         
##  Max.   :9.987429   Max.   :9.998346   Max.   :10.000   Max.   :1.0000         
##  CardiovascularDisease    Diabetes        Depression       HeadInjury    
##  Min.   :0.0000        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000        Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1443        Mean   :0.1508   Mean   :0.2006   Mean   :0.0926  
##  3rd Qu.:0.0000        3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000        Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Hypertension      SystolicBP     DiastolicBP     CholesterolTotal
##  Min.   :0.0000   Min.   : 90.0   Min.   : 60.00   Min.   :150.1   
##  1st Qu.:0.0000   1st Qu.:112.0   1st Qu.: 74.00   1st Qu.:190.3   
##  Median :0.0000   Median :134.0   Median : 91.00   Median :225.1   
##  Mean   :0.1489   Mean   :134.3   Mean   : 89.85   Mean   :225.2   
##  3rd Qu.:0.0000   3rd Qu.:157.0   3rd Qu.:105.00   3rd Qu.:262.0   
##  Max.   :1.0000   Max.   :179.0   Max.   :119.00   Max.   :300.0   
##  CholesterolLDL   CholesterolHDL  CholesterolTriglycerides      MMSE          
##  Min.   : 50.23   Min.   :20.00   Min.   : 50.41           Min.   : 0.005312  
##  1st Qu.: 87.20   1st Qu.:39.10   1st Qu.:137.58           1st Qu.: 7.167602  
##  Median :123.34   Median :59.77   Median :230.30           Median :14.441660  
##  Mean   :124.34   Mean   :59.46   Mean   :228.28           Mean   :14.755132  
##  3rd Qu.:161.73   3rd Qu.:78.94   3rd Qu.:314.84           3rd Qu.:22.161028  
##  Max.   :199.97   Max.   :99.98   Max.   :399.94           Max.   :29.991381  
##  FunctionalAssessment MemoryComplaints BehavioralProblems      ADL           
##  Min.   :0.00046      Min.   :0.000    Min.   :0.0000     Min.   : 0.001288  
##  1st Qu.:2.56628      1st Qu.:0.000    1st Qu.:0.0000     1st Qu.: 2.342836  
##  Median :5.09444      Median :0.000    Median :0.0000     Median : 5.038973  
##  Mean   :5.08005      Mean   :0.208    Mean   :0.1568     Mean   : 4.982958  
##  3rd Qu.:7.54698      3rd Qu.:0.000    3rd Qu.:0.0000     3rd Qu.: 7.581490  
##  Max.   :9.99647      Max.   :1.000    Max.   :1.0000     Max.   : 9.999747  
##    Confusion      Disorientation   PersonalityChanges DifficultyCompletingTasks
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     Min.   :0.0000           
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.0000           
##  Median :0.0000   Median :0.0000   Median :0.0000     Median :0.0000           
##  Mean   :0.2052   Mean   :0.1582   Mean   :0.1508     Mean   :0.1587           
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000     3rd Qu.:0.0000           
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     Max.   :1.0000           
##  Forgetfulness      Diagnosis      DoctorInCharge    
##  Min.   :0.0000   Min.   :0.0000   Length:2149       
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :0.3015   Mean   :0.3537                     
##  3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :1.0000   Max.   :1.0000

In order to conduct a statistical research, we need to clean and prepare the data set for an analysis. Namely, outliers need to be detected and adjusted accordignly and the missing values have to be handled. In this sense, the data preprocessing stage is essential to obtain a viable trend for the further analysis and a prediction.

4.1 Missing values

To check for the missing values (NA’s), in R, we use the plot the missing values through the gg_miss_var() function:

gg_miss_var(data, show_pct = TRUE)

The plot of missing values reveals that the data frame does not contain any NA’s. Thus, we can contintue with the analysis.

4.2 Outliers detection

Outlier detection will be performed on numerical variables. A combination of visual and statistical techniques will be employed to identify potential outliers in the dataset. Specifically, histograms will be used to provide a visual representation of the data distribution, allowing for the detection of unusually extreme values. In addition to this, the Interquartile Range (IQR) method will be applied to quantify the spread of the data, identifying any data points that fall outside the expected range by calculating the distance between the first and third quartiles. Potential outliers will not be removed as this can significantly alter the results of the analysis. Instead they will be replaced by random values. This approach will be implemented in the following analysis.

4.2.1 Age

ggplot(data = data, aes(x = Age)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$Age, 0.25)
Q3 <- quantile(data$Age, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$Age < lower_bound | data$Age > upper_bound, ]
outliers
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Age variable have been detected.

4.2.2 BMI

ggplot(data = data, aes(x = BMI)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$BMI, 0.25)
Q3 <- quantile(data$BMI, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$BMI < lower_bound | data$BMI > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in BMI variable have been detected.

4.2.3 Alcohol Consumption

ggplot(data = data, aes(x = AlcoholConsumption)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$AlcoholConsumption, 0.25)
Q3 <- quantile(data$AlcoholConsumption, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$AlcoholConsumption < lower_bound | data$AlcoholConsumption > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Alcohol Consumption variable have been detected.

4.2.4 Physical Activity

ggplot(data = data, aes(x = PhysicalActivity)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$PhysicalActivity, 0.25)
Q3 <- quantile(data$PhysicalActivity, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$PhysicalActivity < lower_bound | data$PhysicalActivity > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Physical Activity variable have been detected.

4.2.5 Diet Quality

ggplot(data = data, aes(x = DietQuality)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$DietQuality, 0.25)
Q3 <- quantile(data$DietQuality, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$DietQuality < lower_bound | data$DietQuality > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Diet Quality variable have been detected.

4.2.6 Sleep Quality

ggplot(data = data, aes(x = SleepQuality)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$SleepQuality, 0.25)
Q3 <- quantile(data$SleepQuality, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$SleepQuality < lower_bound | data$SleepQuality > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Sleep Quality variable have been detected.

4.2.7 SystolicBP

ggplot(data = data, aes(x = SystolicBP)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$SystolicBP, 0.25)
Q3 <- quantile(data$SystolicBP, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$SystolicBP < lower_bound | data$SystolicBP > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Systolic BP variable have been detected.

4.2.8 Diastolic BP

ggplot(data = data, aes(x = DiastolicBP)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$DiastolicBP, 0.25)
Q3 <- quantile(data$DiastolicBP, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$DiastolicBP < lower_bound | data$DiastolicBP > upper_bound, ]
outliers
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Diastolic BP variable have been detected.

4.2.9 CholesterolTotal

ggplot(data = data, aes(x = CholesterolTotal)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$CholesterolTotal, 0.25)
Q3 <- quantile(data$CholesterolTotal, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$CholesterolTotal < lower_bound | data$CholesterolTotal > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in CholesterolTotal variable have been detected.

4.2.10 CholesterolLDL

ggplot(data = data, aes(x = CholesterolLDL)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$CholesterolLDL, 0.25)
Q3 <- quantile(data$CholesterolLDL, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$CholesterolLDL < lower_bound | data$CholesterolLDL > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in CholesterolLDL variable have been detected.

4.2.11 CholesterolHDL

ggplot(data = data, aes(x = CholesterolHDL)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$CholesterolHDL, 0.25)
Q3 <- quantile(data$CholesterolHDL, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$CholesterolHDL < lower_bound | data$CholesterolHDL > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in CholesterolHDL variable have been detected.

4.2.12 Cholesterol Triglycerides

ggplot(data = data, aes(x = CholesterolTriglycerides)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$CholesterolTriglycerides, 0.25)
Q3 <- quantile(data$CholesterolTriglycerides, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$CholesterolTriglycerides < lower_bound | data$CholesterolTriglycerides > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in CholesterolTriglycerides variable have been detected.

4.2.13 MMSE

ggplot(data = data, aes(x = MMSE)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$MMSE, 0.25)
Q3 <- quantile(data$MMSE, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$MMSE < lower_bound | data$MMSE > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in MMSE variable have been detected.

4.2.14 Functional Assessment

ggplot(data = data, aes(x = FunctionalAssessment)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$FunctionalAssessment, 0.25)
Q3 <- quantile(data$FunctionalAssessment, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$FunctionalAssessment < lower_bound | data$FunctionalAssessment > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in Functional Assessment variable have been detected.

4.2.15 ADL

ggplot(data = data, aes(x = ADL)) +
     geom_histogram(bins = 30, color = "red", fill = "lightpink")

Q1 <- quantile(data$ADL, 0.25)
Q3 <- quantile(data$ADL, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- data[data$ADL < lower_bound | data$ADL > upper_bound, ]
outliers 
##  [1] PatientID                 Age                      
##  [3] Gender                    Ethnicity                
##  [5] EducationLevel            BMI                      
##  [7] Smoking                   AlcoholConsumption       
##  [9] PhysicalActivity          DietQuality              
## [11] SleepQuality              FamilyHistoryAlzheimers  
## [13] CardiovascularDisease     Diabetes                 
## [15] Depression                HeadInjury               
## [17] Hypertension              SystolicBP               
## [19] DiastolicBP               CholesterolTotal         
## [21] CholesterolLDL            CholesterolHDL           
## [23] CholesterolTriglycerides  MMSE                     
## [25] FunctionalAssessment      MemoryComplaints         
## [27] BehavioralProblems        ADL                      
## [29] Confusion                 Disorientation           
## [31] PersonalityChanges        DifficultyCompletingTasks
## [33] Forgetfulness             Diagnosis                
## [35] DoctorInCharge           
## <0 rows> (or 0-length row.names)

No outliers in ADL variable have been detected.

As observed, the outlier detection process, which involved inspecting histograms and boxplots along with calculating the IQR, did not reveal any significant outliers among the numerical variables. Therefore, no additional outlier handling is required.

5 Exploratory Data Analysis

In order to increase our understanding of the Alzheimer’s Disease Diagnosis and the variables that may affect the it, we will make use of a Exploratory Data Analysis (EDA) to examine the relevant relationships in detail. For EDA, we will use first visual inspection of relationship between variables through bar plots, histograms, box plots and density plots. Later, we will conduct hypothesis testing in order to validate the visual inspection. To better understand the variables that affect our target variable, we first examine all facets of the Alzheimer’s Disease Diagnosis.

5.1 Investigation of the target variable: Diagnosis

As indicated by str(data) function, the Diagnosis variable was mistakenly classified as an integer, even though it is a binary variable. To correct this, the Diagnosis variable will be converted to a factor type.

data$Diagnosis <- as.factor(data$Diagnosis)

First, we examine a simple numerical distribution of the data.

summary(data$Diagnosis)
##    0    1 
## 1389  760
760/(1389+760)
## [1] 0.3536529
ggplot(data = data) + 
    geom_bar(aes(x = Diagnosis), fill = c("pink", "lightblue")) +
    labs(title = "Bar plot for the target variable 'Diagnosis'")  

Based on the summary() function and the distribution of Diagnosis variable, we see that 36% of patients in our data have a diagnosis of Alzheimer’s disease.

Now, we come to exploring the relations between Diagnosis variable and our potential predictors. For categorical predictors, we apply contingency table along with two types of bar plots: a standard bar plot and another with same-sized bars, which will allow for comparison of proportions among different categories. For numerical variables, we apply boxplot as well as histogram(for discrete variables) or density plot(for continuous variables).

5.2 Investigation of Medical History variables

5.2.1 Family history: Alzheimer

addmargins(table(data$Diagnosis, data$FamilyHistoryAlzheimers, dnn = c("Diagnosis of Alzheimer", "Family History: Alzheimers")))
##                       Family History: Alzheimers
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1024  365 1389
##                    1    583  177  760
##                    Sum 1607  542 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(FamilyHistoryAlzheimers), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Family History Alzheimer's")


ggplot(data = data) + 
  geom_bar(aes(x = as.factor(FamilyHistoryAlzheimers), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Family History Alzheimer's")

It seems that people who did not have a family history of Alzheimer, might be slighty more likely to have Alzheimer than people who have a family history of Alzheimer.

5.2.2 Cardiovascular disease

addmargins(table(data$Diagnosis, data$CardiovascularDisease, dnn = c("Diagnosis of Alzheimer", "Cardiovascular Disease")))
##                       Cardiovascular Disease
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1200  189 1389
##                    1    639  121  760
##                    Sum 1839  310 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(CardiovascularDisease), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) + 
  xlab("Cardiovascular Disease")

ggplot(data = data) + 
  geom_bar(aes(x = as.factor(CardiovascularDisease), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Cardiovascular Disease")

Based on the standarzied bar plots, people who have Cardiovascular disease seem to be slightly more likely to have Alzheimer.

5.2.3 Diabetes

addmargins(table(data$Diagnosis, data$Diabetes, dnn = c("Diagnosis of Alzheimer", "Diabetes")))
##                       Diabetes
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1168  221 1389
##                    1    657  103  760
##                    Sum 1825  324 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Diabetes), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Diabetes")

ggplot(data = data) + 
  geom_bar(aes(x = as.factor(Diabetes), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Diabetes")

People who have no diabetes are slightly more likely to have ALzheimer.

5.2.4 Depression

addmargins(table(data$Diagnosis, data$Depression, dnn = c("Diagnosis of Alzheimer", "Depression")))
##                       Depression
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1108  281 1389
##                    1    610  150  760
##                    Sum 1718  431 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Depression), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Depression")

ggplot(data = data) + 
  geom_bar(aes(x = as.factor(Depression), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Depression")

Depression doesn’t seem to be indicative of whether someone has Alzheimer or not.

5.2.5 Head Injury

addmargins(table(data$Diagnosis, data$HeadInjury, dnn = c("Diagnosis of Alzheimer", "Head Injury")))
##                       Head Injury
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1254  135 1389
##                    1    696   64  760
##                    Sum 1950  199 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(HeadInjury), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Head Injury")


ggplot(data = data) + 
  geom_bar(aes(x = as.factor(HeadInjury), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Head Injury")

People who did not have Head injury are slighlt more likely to be diagnosed with Alzheimer.

5.2.6 Hypertension

addmargins(table(data$Diagnosis, data$Hypertension, dnn = c("Diagnosis of Alzheimer", "Hypertension")))
##                       Hypertension
## Diagnosis of Alzheimer    0    1  Sum
##                    0   1195  194 1389
##                    1    634  126  760
##                    Sum 1829  320 2149
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Hypertension), fill = Diagnosis)) +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Hypertension")

ggplot(data = data) + 
  geom_bar(aes(x = as.factor(Hypertension), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Hypertension")

People who have hypertension are more likely to be diagnosed with Alzheimer.

5.3 Investigation of Clinical Measurements variables

5.3.1 Systolic BP

ggplot(data = data) +
  geom_bar(aes(x = SystolicBP)) 

ggplot(data = data) +
  geom_bar(aes(x = SystolicBP, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_bar(aes(x = SystolicBP, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = SystolicBP, fill = Diagnosis), alpha = 0.3)

Density plots suggest that people who have Systolic BP in range ~110 to ~140 might be more likely to have Alzheimer.

5.3.2 Diastolic BP

ggplot(data = data) +
  geom_bar(aes(x = DiastolicBP)) 

ggplot(data = data) +
  geom_bar(aes(x = DiastolicBP, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("DiastolicBP")

ggplot(data = data) +
  geom_bar(aes(x = DiastolicBP, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("DiastolicBP")

ggplot(data = data) +
  geom_density(aes(x = DiastolicBP, fill = Diagnosis), alpha = 0.3)

The density curves for both groups largely overlap, indicating that the diastolic levels are quite similar between individuals with and without Alzheimer’s. The curve for people diagnosed with Alzheimer suggests that people having lower (~63 to ~75) Diastolic BP might be slightly more likely to be diagnosed with Alzheimer.

5.3.3 Cholesterol Total

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = CholesterolTotal), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = CholesterolTotal, fill = Diagnosis), alpha = 0.3)

The medians of cholesterol are basically the same as seen in the box plot. The density curves for both groups largely overlap, indicating that the cholesterol levels are quite similar between individuals with and without Alzheimer’s. Cholesterol total doesn’t seem to be a predictor of Alzheimer.

5.3.4 Cholesterol LDL

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = CholesterolLDL), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = CholesterolLDL, fill = Diagnosis), alpha = 0.3)

The medians of cholesterol LDL are very similar as seen in the box plot. The density curves for both groups largely overlap, indicating that the cholesterol LDL levels are quite similar between individuals with and without Alzheimer’s. There is a slight peak in Alzheimer group, which may suggest that people having cholesterol LDL of 50 to 90 may be more likely to be diagnosed with Alzheimer.

5.3.5 Cholesterol HDL

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = CholesterolHDL), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = CholesterolHDL, fill = Diagnosis), alpha = 0.3)

The box plot and density plot suggest that people diagnosed with Alzheimer are more likely to have higher levels of cholesterol HDL. Thus, cholesterol HDL might be a signficant predictor of ALzheimer.

5.3.6 Cholesterol Triglycerides

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = CholesterolTriglycerides), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = CholesterolTriglycerides, fill = Diagnosis), alpha = 0.3)

The box plot suggest that people with Alzheimer have a higher median of cholesterol triglycerides that people with no Alzheimer, however the upper quantile and lower quantile are nearly the same. The density plot suggests that people with cholesterol triglycerides levels of ~230 to ~310 might be more likely to have Alzheimer.

5.4 Investigation of Cognitive and Functional Assessments variables

5.4.1 MMSE

ggplot(data = data) +
    geom_histogram(mapping = aes(x = MMSE), binwidth = 0.3)

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = MMSE), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = MMSE, fill = Diagnosis), alpha = 0.3)

This data is very widespread and no significant trend is evident. It is evident that people with Alzheimer’s tend to have lower MMSE scores given the median, 25th, and 75th percentile are all lower to their non-Alzheimer’s counterparts. People with no Alzheimer are more likely to higher MMSE (> 23), while people with Alzheimer are more likely to have lower MMSE. Thus, MMSE might be a useful predictor.

5.4.2 Functional Assessment

ggplot(data = data) +
    geom_histogram(mapping = aes(x = FunctionalAssessment), binwidth = 0.2)

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = FunctionalAssessment), fill = c("pink", "lightblue")) 

ggplot(data = data) +
  geom_density(aes(x = FunctionalAssessment, fill = Diagnosis), alpha = 0.3)

The boxplot indicates that functional assessment is a very useful variable given the scores are significantly lower in those with Alzheimer’s than in those without. The functional assessment demonstrates a very clear trend where people who score low on Functional Assessment are significantly more likely to have Alzheimer’s, and people who score high on Functional Assessment are less likely to have Alzheimer’s. Thus, Functional Assessment might be an important predictor of Alzheimer.

5.4.3 Memory Complaints

table(data$MemoryComplaints, data$Diagnosis, dnn = c("Memory", "Diagnosis"))
##       Diagnosis
## Memory    0    1
##      0 1228  474
##      1  161  286
ggplot(data = data) +
  geom_bar(aes(x = as.factor(MemoryComplaints), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) 
xlab("Memory Complaints")
## $x
## [1] "Memory Complaints"
## 
## attr(,"class")
## [1] "labels"
ggplot(data = data) +
  geom_bar(aes(x = as.factor(MemoryComplaints), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Memory Complaints")

It appears only 286/760 or roughly 38% of diagnosed Alzheimer’s patients complained about memory. This is still significantly higher than the 12% of people who were not diagnosed but still complained about memory. The bar plots demonstrates how having complaints about memory was a very solid indicator for there being a higher chance of Alzheimer’s than for those who did not complain of memory.

5.4.4 Behavioural Problems

table(data$BehavioralProblems, data$Diagnosis, dnn = c("Behavioral Problems", "Diagnosis"))
##                    Diagnosis
## Behavioral Problems    0    1
##                   0 1255  557
##                   1  134  203
ggplot(data = data) +
  geom_bar(aes(x = as.factor(BehavioralProblems)))

ggplot(data = data) +
  geom_bar(aes(x = as.factor(BehavioralProblems), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Behavioral Problems")

It is far less likely for a patient to have behavioral problems than the opposite. Behavioral problems also serves as a very solid indicator of Alzheimer’s as people with behavioral problems were more likely have diagnosis of Alzheimer’s.

5.4.5 ADL

ggplot(data = data) +
    geom_histogram(mapping = aes(x = ADL), binwidth = 0.2)

ggplot(data = data) +
  geom_density(aes(x = ADL, fill = Diagnosis), alpha = 0.3)

The ADL scores appear to be a little more concentrated around the tails. ADL seems to be a very good indicator of Alzheimer’s because low scores are more likely when having the diagnosis of Alzheimer.

5.5 Investigation of Cognitive and Functional Assessments variables

5.5.1 Confusion

table(data$Confusion, data$Diagnosis, dnn = c("Confusion", "Diagnosis"))
##          Diagnosis
## Confusion    0    1
##         0 1096  612
##         1  293  148
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Confusion)))

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Confusion), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Confusion")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Confusion), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Confusion")

Confusion serves as a poor indicator for Alzheimer’s as the rates of confusion’s between those who were diagnosed and weren’t diagnosed with Alzheimer’s are very similar.

5.5.2 Disorientation

table(data$Disorientation, data$Diagnosis, dnn = c("Disorientation", "Diagnosis"))
##               Diagnosis
## Disorientation    0    1
##              0 1160  649
##              1  229  111
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Disorientation))) 

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Disorientation), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Disorientation")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Disorientation), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Disorientation")

Disorientation seems to be quite uncommon with the vast majority of patients not reporting it. Disorientation is another variable that appears to be poor at predicting Alzheimer’s, given that the rate of the disorientation problems is similar between Alzheimer group and Non-Alzheimer group.

5.5.3 Personality Changes

table(data$PersonalityChanges, data$Diagnosis, dnn = c("Personality Changes", "Diagnosis"))
##                    Diagnosis
## Personality Changes    0    1
##                   0 1172  653
##                   1  217  107
ggplot(data = data) +
  geom_bar(aes(x = as.factor(PersonalityChanges)))

ggplot(data = data) +
  geom_bar(aes(x = as.factor(PersonalityChanges), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Personality Changes")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(PersonalityChanges), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Personality Changes")

In general, reporting personality changes is very uncommon. Rates of Personality Changes are very similar between two groups of diagnosis and thus personality changes are rather a poor indicator of Alzheimer’s.

5.5.4 Difficulty Completing Tasks

table(data$DifficultyCompletingTasks, data$Diagnosis, dnn = c("Difficulty Completing Tasks", "Diagnosis"))
##                            Diagnosis
## Difficulty Completing Tasks    0    1
##                           0 1172  636
##                           1  217  124
ggplot(data = data) +
  geom_bar(aes(x = as.factor(DifficultyCompletingTasks)))

ggplot(data = data) +
  geom_bar(aes(x = as.factor(DifficultyCompletingTasks), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Difficulty Completing Tasks")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(DifficultyCompletingTasks), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Difficulty Completing Tasks")

Difficulty completing tasks was not very common in this data set. No clear trend is visible. It seems that Difficulty completing tasks is not a great indicator of Alzheimer’s diagnosis as rates of people having difficulty with completing tasks are similar in both diagnosis groups.

5.5.5 Forgetfullness

table(data$Forgetfulness, data$Diagnosis, dnn = c("Forgetfulness", "Diagnosis"))
##              Diagnosis
## Forgetfulness   0   1
##             0 970 531
##             1 419 229
ggplot(data = data) +
  geom_bar(aes(x = as.factor(Forgetfulness)))

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Forgetfulness), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Forgetfulness")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Forgetfulness), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Forgetfulness")

Forgetfulness is more common in both groups than previous variables. The rates of forgetfulness are almost exactly the same between those with and without Alzheimer’s. This means very little can be determined about Alzheimer’s diagnosis from that symptom alone.

5.6 Investigation of Demographic details variables

5.6.1 Age

ggplot(data = data) +
  geom_bar(aes(x = Age, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Age")

ggplot(data = data) +
  geom_bar(aes(x = Age, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Age")

Age, in this dataset, doesn’t seem to be a significant predictor of Alzheimer disease.

5.6.2 Gender

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Gender), fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Gender")

ggplot(data = data) +
  geom_bar(aes(x = as.factor(Gender), fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Gender")

There is no significant gender difference in Alzheimer diagnosis.

5.6.3 Ethnicity

data$EthnicityFactor <- factor(data$Ethnicity, 
                                     levels = c(0, 1, 2, 3), 
                                     labels = c("Caucasian", "African American", "Asian","Other"))
ggplot(data = data) +
  geom_bar(aes(x = EthnicityFactor, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Ethnicity")

ggplot(data = data) +
  geom_bar(aes(x = EthnicityFactor, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Ethnicity")

Asians seem to be slighly more likely to be diagnosed with Alzheimer.

5.6.4 Education

data$EducationLevelFactor <- factor(data$EducationLevel, 
                                     levels = c(0, 1, 2, 3), 
                                     labels = c("None", "High School", "Bachelor's","Higher"))
ggplot(data = data) +
  geom_bar(aes(x = EducationLevelFactor, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Education")

ggplot(data = data) +
  geom_bar(aes(x = EducationLevelFactor, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Education")

Individuals with higher level of educations are less likely to be diagnosed with Alzheimers. Thus, education level seem to be an important predictor.

5.7 Investigation of Lifestyle factors variables

5.7.1 BMI

ggplot(data = data) +
  geom_density(aes(x = BMI, fill = Diagnosis), alpha = 0.3) 

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis, y = BMI), fill = c("pink", "lightblue")) 

The boxplot show that individuals diagnosed with Alzheimers have slighly higher BMI. The density plots reveals that individuals with BMI larger than 33 are more likely to have Alzheimer.

5.7.2 Smoking

data$SmokingFactor <- factor(data$Smoking, 
                                     levels = c(0, 1), 
                                     labels = c("Non-Smoker", "Smoker"))
ggplot(data = data) +
  geom_bar(aes(x = SmokingFactor, fill = Diagnosis), position = "stack") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Smoking")

ggplot(data = data) +
  geom_bar(aes(x = SmokingFactor, fill = Diagnosis), position = "fill") +
  scale_fill_manual(values = c("pink", "lightblue")) +
  xlab("Smoking")

Smoking doesn’t seem to be an important predictor of Alzheimer.

5.7.3 Alcohol Consumption

ggplot(data = data) +
  geom_density(aes(x = AlcoholConsumption, fill = Diagnosis), alpha = 0.3) 

ggplot(data = data) +
  geom_boxplot(aes(x = Diagnosis , y = AlcoholConsumption), 
               fill = c("pink", "lightblue")) 

The median seems to be the same for people with ALzheimer and for people without Alzheimer, thus alcohol consumption doesn’t seem to be an important predictor for Alzheimer.

5.7.4 Physical Activity

ggplot(data = data) +
  geom_density(aes(x = PhysicalActivity, fill = Diagnosis), alpha = 0.3) 

ggplot(data = data) +
  geom_boxplot(aes(x = PhysicalActivity, y = Diagnosis), 
               fill = c("pink", "lightblue")) 

The boxplot and denisty plot suggest that there is Physical activity is not a significant predictor of Alzheimer as the density curves overlap a lot and median, 1st quantile and 3rd quantile are very similar.

5.7.5 Diet Quality

ggplot(data = data) +
  geom_density(aes(x = DietQuality, fill = Diagnosis), alpha = 0.3)

ggplot(data = data) +
  geom_boxplot(aes(x = DietQuality, y = Diagnosis), fill = c("pink", "lightblue")) 

The density plot shows that their is a lot of overlap between both groups, suggesting that diet quality is not a signifcant predictor.

5.7.6 Sleep Quality

ggplot(data = data) +
  geom_density(aes(x = SleepQuality, fill = Diagnosis), alpha = 0.3) 

ggplot(data = data) +
  geom_boxplot(aes(x = SleepQuality, y = Diagnosis), 
               fill = c("pink", "lightblue")) 

The boxplot suggests that individuals without Alzheimer have a better sleep quality. The density plot confirms that, therefore individuals with Alzheimer might be more likely to have worse sleep quality, potentially being a good predictor of Alzheimer.

5.8 Conclusions from Graphical Visualization

We analyzed six major areas of health information, each with several variables (32 total). The variables analyzed were demographic details, medical history, clinical measurements, cognitive and functional assessments, symptoms and diagnosis information. Diagnosis information served as the variable to which we compared all of these variables, in order to determine which variable could best predicted Alzheimer’s. Using EDA, we decided that variables below can be considered as potential predictors of Alzheimer’s disease.

  • Ethnicity - Asians appear slightly more likely than other ethnicities to be diagnosed with Alzheimer’s.
  • EducationLevel - The rate of Alzheimer’s decreases as the level of education increases, but the differences are not vast.
  • BMI - Individuals with Alzheimer’s have slightly higher BMIs than those without Alzheimer’s. A BMI above 32.5 appears to demonstrate the greatest difference between Alzheimer’s diagnoses, but overall, the difference is not very significant.
  • FamilyHistoryAlzheimers - The data shows that people without a family history of Alzheimer’s had a slightly higher likelihood of being diagnosed with it.
  • CardiovascularDisease - Those with cardiovascular disease had a slightly higher rate of Alzheimer’s.
  • Diabetes - Diabetics appear slightly less likely to have Alzheimer’s, but the difference is not large.
  • HeadInjury - Those without head injuries were slightly less likely to have Alzheimer’s, but again, this difference is not very large.
  • Hypertension - People with hypertension were slightly more likely to have an Alzheimer’s diagnosis than those without hypertension.
  • SystolicBP - The density plot suggests people with a systolic BP between 110-140 may be slightly more likely to have an Alzheimer’s diagnosis.
  • DiastolicBP - The curves mirror each other very closely, but it also indicates people with a Diastolic BP of 63 to 75 are slightly more likely to have Alzheimer’s.
  • CholesterolHDL - The box plot and density plot indicate that people with higher levels of Cholesterol HDL are more likely to have Alzheimer’s and vice-versa.
  • CholesterolTriglycerides - People with Alzheimer’s have a higher median for cholesterol triglycerides, indicating that it may be a suitable predicting factor.
  • MMSE - People with no Alzheimer are more likely to higher MMSE (> 23), while people with Alzheimer are more likely to have lower MMSE, making it a very suitable predictor of Alzheimer’s.
  • FunctionalAssessment - The functional assessment scores of those with Alzheimer’s are much lower compared to people who have not been diagnosed, making it a good predictor of Alzheimer’s.
  • MemoryComplaints - Only 12% of non-Alzheimer patients complained of memory problems, which is in stark contrast to the 38% of Alzheimer’s patients who complained, meaning there is a very significant difference between the two.
  • BehavioralComplaints - People with behavioral problems were far more likely to have Alzheimer’s than people without behavioral problems, making it a suitable indicator.
  • ADL - ADL scores are very low in people with Alzheimer’s and much higher in those without, making it a very accurate predictor of Alzheimer’s.
  • Sleep quality - Individuals with Alzheimer are be more likely to have worse sleep quality.

5.9 Hypothesis testing

In order to validate the chosen predictors as deducted from graphical inspection we conduct hypothesis testing on these variables.

5.10 Ethnicity

chisq.test(table(data$Diagnosis, data$Ethnicity))
## 
##  Pearson's Chi-squared test
## 
## data:  table(data$Diagnosis, data$Ethnicity)
## X-squared = 6.3021, df = 3, p-value = 0.0978

P-value (0.0978) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Alzheimer diagnosis among different ethnitic groups.

5.11 Education

chisq.test(table(data$Diagnosis, data$EducationLevel))
## 
##  Pearson's Chi-squared test
## 
## data:  table(data$Diagnosis, data$EducationLevel)
## X-squared = 4.4531, df = 3, p-value = 0.2165

P-value (0.2165) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Alzheimer diagnosis among different education levels.

5.12 BMI

t.test(BMI ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  BMI by Diagnosis
## t = -1.2148, df = 1537.9, p-value = 0.2246
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.0395624  0.2444058
## sample estimates:
## mean in group 0 mean in group 1 
##        27.51509        27.91267

P-value (0.2246) is greater than alpha level and so we do not reject null hypothesis. There is no significant difference in the mean BMI between people diagnosed with Alzheimer and people without Alzheimer.

5.13 Family history ALzheimer

prop.test(table(data$FamilyHistoryAlzheimers, data$Diagnosis))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(data$FamilyHistoryAlzheimers, data$Diagnosis)
## X-squared = 2.1703, df = 1, p-value = 0.1407
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.08340223  0.01096316
## sample estimates:
##    prop 1    prop 2 
## 0.6372122 0.6734317

P-value (0.1407) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Family History Alzheimer between people who have Alzheimer’s and people who don’t have Alzheimer’s.

5.14 Cardiovascular disease

prop.test(table(data$CardiovascularDisease, data$Diagnosis))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(data$CardiovascularDisease, data$Diagnosis)
## X-squared = 1.9477, df = 1, p-value = 0.1628
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01753589  0.10323815
## sample estimates:
##    prop 1    prop 2 
## 0.6525285 0.6096774

P-value (0.1628) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Cardiovascular disease between people who have Alzheimer and people who don’t have Alzheimer.

5.15 Diabetes

prop.test(table(data$Diabetes, data$Diagnosis))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(data$Diabetes, data$Diagnosis)
## X-squared = 1.9532, df = 1, p-value = 0.1622
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.09919617  0.01499864
## sample estimates:
##    prop 1    prop 2 
## 0.6400000 0.6820988

P-value (0.1622) is greater than alpha level(0.05), thus we don’t have enough evidence to reject null hypothesis. There is no significant difference in proportion of Diabetes diagnosis between people who have Alzheimer and people who don’t have Alzheimer.

5.16 HeadInjury

prop.test(table( data$HeadInjury, data$Diagnosis))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(data$HeadInjury, data$Diagnosis)
## X-squared = 0.83677, df = 1, p-value = 0.3603
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.10637604  0.03574597
## sample estimates:
##    prop 1    prop 2 
## 0.6430769 0.6783920

Since the p value is not smaller than 0.05 we don’t have enough evidence that their is a significant difference. There is no significant difference in proportion of Head Injury between people who have Alzheimer and people who don’t have Alzheimer.

5.17 Hypertension

 prop.test(table(data$Hypertension, data$Diagnosis))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  table(data$Hypertension, data$Diagnosis)
## X-squared = 2.4425, df = 1, p-value = 0.1181
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01252733  0.10675232
## sample estimates:
##    prop 1    prop 2 
## 0.6533625 0.6062500

Since the p value is not smaller than 0.05 we don’t have enough evidence that their is a significant difference, so we do not reject H0. There is no significant difference in proportion of Hypertension between people who have Alzheimer and people who don’t have Alzheimer.

5.18 SystolicBP

t.test(SystolicBP ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  SystolicBP by Diagnosis
## t = 0.7235, df = 1560.4, p-value = 0.4695
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.449881  3.144540
## sample estimates:
## mean in group 0 mean in group 1 
##        134.5644        133.7171

Since the p value is larger than 0.05 we don’t have enough evidence that their is a significant difference, so we do not reject H0. There is no significant difference in mean of Systolic BP between people who have Alzheimer and people who don’t have Alzheimer.

5.19 DiastolicBP

t.test(DiastolicBP ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  DiastolicBP by Diagnosis
## t = -0.24612, df = 1577.4, p-value = 0.8056
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.746453  1.357040
## sample estimates:
## mean in group 0 mean in group 1 
##        89.77898        89.97368

Since the p value is larger than 0.05 we don’t have enough evidence that their is a significant difference. There is no significant difference in mean of Diastolic BP between people who have Alzheimer and people who don’t have Alzheimer.

5.20 CholesterolHDL

t.test(CholesterolHDL ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  CholesterolHDL by Diagnosis
## t = -1.9706, df = 1551.2, p-value = 0.04895
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -4.111498327 -0.009497238
## sample estimates:
## mean in group 0 mean in group 1 
##        58.73483        60.79533

Since the P-value is less than 0.05, we reject H0. The difference in the mean number of CholesterolHDL between both groups is statistically significant.

5.21 CholesterolTriglycerides

t.test(CholesterolTriglycerides ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  CholesterolTriglycerides by Diagnosis
## t = -1.0502, df = 1558.6, p-value = 0.2938
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -13.866075   4.195806
## sample estimates:
## mean in group 0 mean in group 1 
##        226.5715        231.4067

Since the P-value is larger than 0.05 , we do not reject H0. Thus the difference in the mean number of CholesterolTriglycerides between both groups is not statistically significant.

5.22 MMSE

t.test(MMSE ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  MMSE by Diagnosis
## t = 12.025, df = 1851.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  3.574302 4.967469
## sample estimates:
## mean in group 0 mean in group 1 
##        16.26554        11.99466

MMSE scores were significantly lower in patients with Alzheimer’s and this is statistically significant given the p-value is 2.2e-16. Thus, MMSE might be a useful predictor of Alzheimer.

5.23 FunctionalAssessment

t.test(FunctionalAssessment ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  FunctionalAssessment by Diagnosis
## t = 18.552, df = 1660.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  1.973921 2.440657
## sample estimates:
## mean in group 0 mean in group 1 
##        5.860669        3.653380

Functional Assessment scores were significantly lower in patients with Alzheimer’s and this is also statistically significant given the p-value is 2.2e-16. Thus, Functional assessment might be a useful predictor of Alzheimer.

5.24 Memory Complaints

t.test(MemoryComplaints ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  MemoryComplaints by Diagnosis
## t = -13.305, df = 1129.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2988062 -0.2220039
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1159107       0.3763158

There were more memory complaints in patients with Alzheimer’s and this is statistically significant given the p-value is again 2.2e-16. Thus, memory complaints might be a useful predictor of Alzheimer.

5.25 BehavioralComplaints

t.test(BehavioralProblems ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  BehavioralProblems by Diagnosis
## t = -9.528, df = 1136.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2057706 -0.1354954
## sample estimates:
## mean in group 0 mean in group 1 
##      0.09647228      0.26710526

Behavioral problems were significantly more common in patients with Alzheimer’s and this is statistically significant given the p-value of 2.2e-16. Thus, behavioural complaints might be useful in predicting the diagnosis of Alzheimer.

5.26 ADL

t.test(ADL ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  ADL by Diagnosis
## t = 16.546, df = 1622.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  1.807000 2.293026
## sample estimates:
## mean in group 0 mean in group 1 
##        5.707951        3.657938

ADL scores were significantly lower in patients with Alzheimer’s and this is statistically significant given the p-value of 2.2e-16. ADL might be a useful predictor of Alzheimer’s diagnosis.

5.27 Sleep Quality

t.test(SleepQuality ~ Diagnosis, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  SleepQuality by Diagnosis
## t = 2.6282, df = 1567.7, p-value = 0.008669
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.05289963 0.36417980
## sample estimates:
## mean in group 0 mean in group 1 
##        7.124832        6.916292

Sleep quality was lower in patients with Alzheimer’s and it is a statistically significant different given the p-value of 0.008669, still being lower than alpha level(0.05). Sleep quality might also be a useful preditor of Alzheimer’s diagnosis.

5.28 Conclusion

Variables that seem to be a useful predictor of Alzheimer’s based on hypothesis testing:

  • Cholesterol HDL
  • MMSE
  • Functional Assessment
  • Memory Complaints
  • Behavioral Complaints
  • ADL
  • Sleep Quality

6 Data Preparation

In this stage, the dataset will be prepared for the modeling section. Here, our dataset is partitioned randomly into two groups: train set (80%) and test set (20%). Here, the partition() function is used from the liver package, by inputting a random seed beforehand.

set.seed(5)

data_sets = partition(data = data, prob = c(0.8, 0.2))

train_set_A = data_sets$part1
test_set_A = data_sets$part2

actual_test_A  = test_set_A$Diagnosis

Since the target variable Diagnosis is binary, we will validate the partion by inspecting whether the proportion of Diagnosis differ between train and test set. For this we will use two sample z-test, with a signifcance level of 0.05. Based on these, the following hypotheses are stated:

x1 = sum(train_set_A$Diagnosis == 1)
x2 = sum(test_set_A$Diagnosis == 1)

n1 = nrow(train_set_A)
n2 = nrow(test_set_A)

prop.test(x = c(x1, x2), n = c(n1, n2))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 2.5685, df = 1, p-value = 0.109
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.094033560  0.009761076
## sample estimates:
##    prop 1    prop 2 
## 0.3448884 0.3870246

There is not enough evidence to suggest a statistically significant difference in the proportions of Diagnosis = 1 between the train and test sets. Since p-value (0.109) is greater than alpha level, we do not reject null hypothesis. Thus, we may follow with further analysis as train data set seems to be representative of test set.

7 Modelling

Considering the theoretical framework and outputs of the Exploratory Data Analysis (EDA), out of the 32 predictors in the dataset, the following variables are the ones that are detected to have an influence on the target variable and deemed to be important: CholesterolHDL, MMSE, FunctionalAssessment, MemoryComplaints, BehavioralProblems, ADL and SleepQuality. Given the variables and data partitioning, different algorithms will be utilized to determine the effects of such variables on Diagnosis. As such, for the creation of different modeling algorithms, the following formula is created:

formula = Diagnosis~ CholesterolHDL + MMSE + FunctionalAssessment + MemoryComplaints +BehavioralProblems + ADL + SleepQuality

First, we will start with applying kNN algorithm. Second, we will explore Naive Bayes Classifcation. Lastly, we will go through Logistc regression.

7.1 kNN algorithm

The optimal value of k will be chosen based on the Error Rate, using the function kNN.plot().

kNN.plot(formula, train = train_set_A, test = test_set_A, transform = 'minmax',
          k.max = 30, set.seed = 7)

The optimal value of k seems to be k = 25, as the error rate is the lowest.

7.2 Naive Bayes Classifer

The Naive Bayes Classifier model will be created by calling the naive_bayes() function in R.

train_set_A$Diagnosis <- as.factor(train_set_A$Diagnosis)
test_set_A$Diagnosis <- as.factor(test_set_A$Diagnosis)

naive_bayes = naive_bayes(formula, data = train_set_A)

naive_bayes
## 
## ================================= Naive Bayes ==================================
## 
## Call:
## naive_bayes.formula(formula = formula, data = train_set_A)
## 
## -------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## -------------------------------------------------------------------------------- 
##  
## A priori probabilities: 
## 
##         0         1 
## 0.6551116 0.3448884 
## 
## -------------------------------------------------------------------------------- 
##  
## Tables: 
## 
## -------------------------------------------------------------------------------- 
## :: CholesterolHDL (Gaussian) 
## -------------------------------------------------------------------------------- 
##               
## CholesterolHDL        0        1
##           mean 58.78509 60.32494
##           sd   23.22257 23.11797
## 
## -------------------------------------------------------------------------------- 
## :: MMSE (Gaussian) 
## -------------------------------------------------------------------------------- 
##       
## MMSE           0         1
##   mean 16.027019 11.946981
##   sd    8.921653  7.213687
## 
## -------------------------------------------------------------------------------- 
## :: FunctionalAssessment (Gaussian) 
## -------------------------------------------------------------------------------- 
##                     
## FunctionalAssessment        0        1
##                 mean 5.931812 3.689126
##                 sd   2.793780 2.613157
## 
## -------------------------------------------------------------------------------- 
## :: MemoryComplaints (Gaussian) 
## -------------------------------------------------------------------------------- 
##                 
## MemoryComplaints         0         1
##             mean 0.1192825 0.3611584
##             sd   0.3242661 0.4807460
## 
## -------------------------------------------------------------------------------- 
## :: BehavioralProblems (Gaussian) 
## -------------------------------------------------------------------------------- 
##                   
## BehavioralProblems          0          1
##               mean 0.09596413 0.26575809
##               sd   0.29467421 0.44211279
## 
## --------------------------------------------------------------------------------
## 
## # ... and 2 more tables
## 
## --------------------------------------------------------------------------------
summary(naive_bayes)
## 
## ================================= Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = formula, data = train_set_A) 
## - Laplace: 0 
## - Classes: 2 
## - Samples: 1702 
## - Features: 7 
## - Conditional distributions: 
##     - Gaussian: 7
## - Prior probabilities: 
##     - 0: 0.6551
##     - 1: 0.3449
## 
## --------------------------------------------------------------------------------

7.3 Logistic Regression

In order to conduct a logistic regression analysis, the following model will be used:

\[ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7)}} \]

The model will be created by calling the glm() function in R and use summary() function to analyze the significance and the coefficients of the variables.

logreg= glm(formula, data = data, family = binomial)

summary(logreg)
## 
## Call:
## glm(formula = formula, family = binomial, data = data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           3.973338   0.370435  10.726   <2e-16 ***
## CholesterolHDL        0.004878   0.002708   1.802   0.0716 .  
## MMSE                 -0.107115   0.008081 -13.256   <2e-16 ***
## FunctionalAssessment -0.445320   0.026087 -17.071   <2e-16 ***
## MemoryComplaints      2.586858   0.165096  15.669   <2e-16 ***
## BehavioralProblems    2.464977   0.180858  13.629   <2e-16 ***
## ADL                  -0.414363   0.025554 -16.215   <2e-16 ***
## SleepQuality         -0.056527   0.035619  -1.587   0.1125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2792.3  on 2148  degrees of freedom
## Residual deviance: 1602.4  on 2141  degrees of freedom
## AIC: 1618.4
## 
## Number of Fisher Scoring iterations: 6

Based on the output, it can be seen that MMSE, FunctionalAssessment, MemoryComplaints, BehaviouralProblems, ADL are significant at alpha = 0.001, and CholesterolHDL is significant at the alpha level of 0.1. Sleep Quality has a high p-value(0.1125) and is not significant. Therefore, the removal of this variable could be considered.

7.3.1 Regression model verification

After the model is created, it is important to check whether it satisfies the assumptions and requirements of the modeling section. In this sense, the multicollinearity will be tested with car::vif()

car::vif(logreg)
##       CholesterolHDL                 MMSE FunctionalAssessment 
##             1.004962             1.192945             1.295570 
##     MemoryComplaints   BehavioralProblems                  ADL 
##             1.252231             1.258978             1.300856 
##         SleepQuality 
##             1.001082

None of VIF values of the predictors exceeds 5, indicating no collinearity. Thus, no assumption is violated.

8 Model Evaluation

To evaluate predictive models, which include binary variable as target variable, Confusion Matrix, ROC curve and AUC will be used.

8.1 Evaluation of knn Algorithm

8.1.1 Confusion Matrix

predict_knn_25_trans = kNN(formula, train = train_set_A, test = test_set_A, transform = "minmax", k = 25)
conf.mat.plot(predict_knn_25_trans, actual_test_A)

This confusion matrix demonstrates the algorithm and the chosen predictor variables are quite effective in predicting Alzheimer’s. Sensitivity is equal to 152/173 = 0.879, meaning than 87.9% of predicted Alzheimer’s patients were actually diagnosed with Alzheimer’s. Specificity equals 255/274 = 0.93. This means that our model correctly identifies 93.1% of the negative class(no Alzheimer diagnosis). This suggests that the model is very accurate in predictions.

8.1.2 ROC Curve and AUC

prob_knn = kNN(formula, train = train_set_A, test = test_set_A, transform = "minmax", k = 25, type = "prob")[, 1]
roc_knn = roc(actual_test_A, prob_knn)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
ggroc(roc_knn, size = 0.8) +
  theme_minimal() + 
  ggtitle(paste("ROC plot for kNN; AUC =", round(auc(roc_knn), 3))) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "inside", text = element_text(size = 17))

The ROC curve and AUC of 0.936, suggests that kNN performs well in predciting our data.

8.2 Evaluation of Naive Bayes Classifer

8.2.1 Confusion Matrix

prob_naive_bayes = predict(naive_bayes, test_set_A, type = "prob")[, 1]

conf.mat(prob_naive_bayes, actual_test_A, cutoff = 0.5, reference = "0")
##        Actual
## Predict   0   1
##       0 242  48
##       1  32 125
conf.mat.plot(prob_naive_bayes, actual_test_A, cutoff = 0.5, reference = "0")

There are 125 True Positives and 242 True Negatives. However, there are 48 False Negatives and 32 False Positives. This implies that sensitivity and specificity are, 0.72 and 0.88 respectively. This means that predictive accuracy of the model is relatively good, although we can already see it seems to be worse than kNN.

8.2.2 ROC and AUC curve

prob_naive_bayes = predict(naive_bayes, test_set_A, type = "prob")[, 1]
roc_naive_bayes = roc(actual_test_A, prob_naive_bayes)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
ggroc(roc_naive_bayes, size = 0.8) +
  theme_minimal() + 
  ggtitle(paste("ROC plot for Naive Bayes; AUC =", round(auc(roc_naive_bayes), 3))) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "inside", text = element_text(size = 17))

The ROC curve and AUC of 0.934, suggests that Naive Bayes Classifier performs well.

8.3 Evaluation of Logistic Regression

8.3.1 Confusion Matrix

prob_logreg <- predict(logreg, test_set_A, type = "response")

conf.mat.plot(prob_logreg, actual_test_A, cutoff = 0.5, reference = "1")

This confusion matrix demonstrates the algorithm and the chosen predictor variables are quite effective in predicting Alzheimer’s. Sensitivity is equal to 135/173, meaning than 78% of predicted Alzheimer’s patients were actually diagnosed with Alzheimer’s. Specificity equals 249/274 = 0.91. This means that our model correctly identifies 91% of the negative class(no Alzheimer diagnosis). This suggests that the model is rather accurate in predictions.

8.3.2 ROC and AUC curve

prob_logreg <- predict(logreg, test_set_A, type = "response")
roc_logreg_1 = roc(actual_test_A, prob_logreg)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(roc_logreg_1, size = 0.8) +
  theme_minimal() + 
  ggtitle(paste("ROC plot for Logistic Regression; AUC =", round(auc(roc_logreg_1), 3))) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "inside", text = element_text(size = 17))

The ROC curve and AUC of 0.919, suggests that Logistic Regression performs well in this context.

8.4 Comparison of the models

ggroc(list(roc_naive_bayes, roc_knn, roc_logreg_1), size = 0.8) +
  theme_minimal() + 
  ggtitle("ROC plots with their AUC values") +
  scale_color_manual(values = 1:3, 
                     labels = c(
                       paste("Bayes; AUC=", round(auc(roc_naive_bayes), 3)),
                       paste("KNN; AUC=", round(auc(roc_knn), 3)),
                       paste("Logistic Regression; AUC=", round(auc(roc_logreg_1), 3))
                     )) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.7, 0.3), text = element_text(size = 17))

The AUC values range from 0.919 to 0.936, indicating that all three models have very good discriminatory power. Nevertheless, KNN slightly outperforms the other models with AUC equal to 0.936.

Moreover, when comparing the confusion matrices of three algorithms: kNN, Naive Bayes Classifier and Logistic Regression, it is evident that kNN is the most accurate having a sensitivity of 87.9% and a specificity of 93.1%. The other two algorithms even though were still good, could not as accurately predict Alzheimer’s as kNN, and therefore kNN can be demonstrated to be the most optimal model for predicting Alzheimer’s in this case.

8.5 Hypotheses evaluation

Our model identified several key predictors for Alzheimer’s diagnosis, including: Cholesterol HDL, MMSE, Functional Assessment, Memory Complaints, Behavioral Complaints, ADL and Sleep Quality. These variables were found to have the most significant associations with Alzheimer’s diagnosis.

This exploratory research consists of six main hypotheses regarding the capabilities of different categories of variables to predict an Alzheimer’s diagnosis. Given the findings of individual comparative statistics and the predictive models, these hypotheses will be reviewed and assessed. H1 proposed that older age would be associated with a higher likelihood of Alzheimer’s diagnosis. The results of the data analysis revealed that age is not statistically significant in determining whether an individual has Alzheimer’s or not. Based on this finding, there is insufficient support for H1 and so we do not accept it. H2 suggested that a healthier lifestyle, characterized by a lower BMI, non-smoking status, low alcohol consumption, regular physical activity, good diet quality, and better sleep quality, would be associated with a lower likelihood of Alzheimer’s diagnosis. Sleep quality was determined by the analysis to be a statistically significant predictor of Alzheimer’s, whereas the other factors were not found to be statistically significant. Thus, some effect for H2 was found and it can be partially accepted. H3 stated that history of chronic health conditions such as cardiovascular disease, diabetes, depression, hypertension, and head injury and a family history of Alzheimer’s would increase the likelihood of an Alzheimer’s diagnosis. None of these factors were found to be statistically significant in determining an Alzheimer’s diagnosis. Thus, H3 cannot be accepted. H4 suggested that poor cardiovascular health, indicated by a variable such as high blood pressure and unfavorable cholesterol levels (high total cholesterol, high LDL, low HDL, and high triglycerides), would be associated with a higher likelihood of Alzheimer’s diagnosis. CholesterolHDL was, indeed, found to be a significant predictor of whether or not a person would be diagnosed with Alzheimer’s. The model supports this hypothesis, meaning H4 can be partially accepted. H5 proposed that lower scores on cognitive and functional assessments and the presence of memory and behavioral complaints would result in a higher likelihood of Alzheimer’s diagnosis. The data analysis revealed that MMSE scores, FunctionalAssessment scores, MemoryComplaints, BehavioralComplaints, and ADL scores were all statistically significant predictors of an Alzheimer’s diagnosis. This indicates a full support for H5 and so it is accepted. Finally, H6 suggested that the presence of cognitive and behavioral symptoms such as confusion, disorientation, and forgetfulness would be positively associated with Alzheimer’s diagnosis. Within this category, none of the variables was determined to be statistically significant. Thus, H6 is not accepted.

8.5.1 Conclusion

Given the challenges surrounding the prediction, diagnosis, and treatment of Alzheimer’s, there has long been difficulty in assessing and predicting an Alzheimer’s diagnosis. As demonstrated in previous literature, many of the factors that lead to or indicate Alzheimer’s can be difficult to notice. Thus, this research is important as it demonstrates how data can be used to determine which factors are the best predictors of the disease, and that the models can also be utilized in a medical setting. That said, of the plethora of factors that indicate Alzheimer’s, all of them must be tested rigorously to ensure the proper ones are selected to improve the quality of healthcare outcomes for current and future patients.

Our analysis revealed predictors, including Cholesterol HDL, MMSE scores, Functional Assessment, Memory Complaints, Behavioral Complaints, ADL scores, and Sleep Quality, that were significantly associated with Alzheimer’s diagnosis, thereby fulfilling our research objective. These findings underscore the importance of both cognitive assessments and lifestyle factors in identifying individuals at higher risk for Alzheimer’s. Considering the objective and the methodology of the research, the results of the data analysis suggest that some predictors might be especially important in the prediction of Alzheimer’s. Age, for example, was originally predicted to be a very significant indicator of Alzheimer’s diagnosis, however, the findings of this research did not have enough support to accept this hypothesis drawn from previous literature. Given the nature of the data, this finding is likely due to the fact that the dataset did not contain a wide range of ages, but more elderly people, meaning there is not necessarily a contradiction with pre existing literature. Regarding lifestyle factors, pre-existing literature suggested that a healthier lifestyle would be protective against Alzheimer’s. The model found that sleep quality was the most important factor within the category of lifestyle.This highlights the importance of sleep quality in relation to cognitive health.For chronic health conditions, despite the widespread belief that conditions like cardiovascular disease, diabetes, and depression increase Alzheimer’s risk, the analysis found no significant associations for these variables. This may reflect the fact that the differences between individuals with and without these conditions were not substantial enough to influence the model’s predictions. On the contrary, two categories which did have significant effects on predicting Alzheimer’s diagnosis were clinical measurement and cognitive/functional assessments. Cholesterol was suggested in previous literature to have a negative effect on cognitive health, and this was supported by the findings of the model, which suggested that higher levels of Cholesterol HDL were positively correlated with Alzheimer’s diagnosis. As for the cognitive and functional assessments, these are very clearly significant in the prediction of Alzheimer’s diagnosis/ This data corroborated the findings of previous research, which demonstrated that people with Alzheimer’s performed far worse on similar tests than those without Alzheimer’s.

The results of our study offer several actionable insights for both clinical practical and research. First, preventive strategies should focus on addressing modifiable risk factors that emerged as significant in our analysis. More specifically, improving sleep quality should be a major focus point in the efforts to prevent Alzheimer’s, given the significant link between poor sleep quality and the increased risk for Alzheimer’s. Healthcare providers should also take more action in the process of monitoring and improving a person’s sleep quality. This could be implemented by sleep therapy to enhance a person’s sleep quality and patterns. Another important aspect of preventing Alzheimer’s is cholesterol management. Our study indicates that the higher the HDL cholesterol levels are the higher the risks of being diagnosed with Alzheimer’s. As such, healthcare providers should regularly monitor HDL levels and intervene when necessary. Second, early detection is incredibly important in the case of Alzheimer’s disease, thus the most reliable predictors identified in this study should be taken into account. Functional assessments such as MMSE are crucial here, and so should remain an important part in the screening protocols and be regularly used on the high-risk-populations. Another two significant predictors are the Behavioral Complaints and the Memory Complaints. These symptoms are also an important part of the screening for early signs of Alzheimer’s. A right protocol to monitor these predictors is an essential part in the early detection of Alzheimers’s, and by the use of these predictors the early diagnosis rates could improve. Third, patient care should be improved. A key predictor we found in our study is the activities of daily living (ADL). Thus, maintaining a patient’s independence in daily tasks for as long as possible is vital. Encouraging patients to develop consistent habits around daily tasks can extend their independence. Where necessary, a tailored care plan should be introduced to support patients in maintaining these activities for as long as possible.

Future research into Alzheimer’s determinants and potential prevention measures still is needed. Our research revealed some patterns and trends that can be further tested. A lot still is unknown. For example, how long term improvement in these areas affect the risk of being diagnosed with Alzheimer’s, giving potential areas for future research.

9 Reference list

Alzheimer’s Association. (2024). 2024 Alzheimer’s Disease Facts and Figures. Alzheimer’s Association. https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf Breijyeh, Z., & Karaman, R. (2020). Comprehensive Review on Alzheimer’s Disease: Causes and Treatment. Molecules, 25(24), 5789. https://doi.org/10.3390/molecules25245789 El Kharoua, R. (2024). Alzheimer’s Disease Dataset. Kaggle.com. https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset/data?select=alzheimers_disease_data.csv Suh, G., Ju, Y., Yeon, B. K., & Shah, A. (2004). A longitudinal study of Alzheimer’s disease: rates of cognitive and functional decline. International Journal of Geriatric Psychiatry, 19(9), 817–824. https://doi.org/10.1002/gps.1168